iscream

Reading WGBS data for analysis

James Eapen, Jacob Morrison, Hui Shen

July 25, 2024

Whole genome bisulfite sequencing

Converting an epigenetic difference into a genetic difference


mC \(\xrightarrow{bisulfite}\) C


    C \(\xrightarrow{bisulfite}\) U

WGBS figure by Jacob Morrison

DNA methylation analysis

WGBS alignment

  • Bismark
  • BISCUIT

Methylation analysis

  • BSseq: read.bismark \(\to\) bsseq \(\to\) DMRSeq, DMRcate

  • biscuiteer: readBiscuit \(\to\) BSseq

  • scMET

Aligned WGBS data formats

BISCUIT

chr start end beta cov
chr1 81 82 1.000 4
chr1 92 93 1.000 4
chr1 149 150 0.400 5
chr1 234 235 0.857 7
chr1 317 318 0.889 9
chr1 318 319 0.667 3
chr1 357 358 0.900 10
chr1 358 359 0.750 4

Bismark

chr start end % M M U
chr1 81 81 100 4 0
chr1 92 92 100 4 0
chr1 149 149 40 2 3
chr1 234 234 85 6 1
chr1 317 317 88 8 1
chr1 318 318 66 2 1
chr1 357 357 90 9 1
chr1 358 358 75 3 1

Analysis

scMET for scWGBS

  • Bayesian modelling of single-cell WGBS
  • Does not provide a data input function

Analysis

scMET input format

Feature Cell total reads met reads
Feature 1 Cell 13 14 9
Feature 2 Cell 21 9 9
Feature 3 Cell 11 10 8
Feature 4 Cell 50 10 1
Feature 5 Cell 44 11 6
Feature 6 Cell 14 12 8

Analysis

BSseq

  • read.bismark to read Bismark .cov files

  • M and Coverage count matrices

  • GRanges storing CpG loci

  • Used by DMRSeq, DMRcate

An object of type 'BSseq' with
  69349 methylation loci
  93 samples
has not been smoothed
All assays are in-memory
read.bismark(files)

Analysis

BSseq

  • read.bismark to read Bismark .cov files

  • M and Coverage count matrices

  • GRanges storing CpG loci

  • Used by DMRSeq, DMRcate

An object of type 'BSseq' with
  69349 methylation loci
  93 samples
has not been smoothed
All assays are in-memory
gr <- GRanges([CpG loci])
read.bismark(files, gr)

iscream overview

iscream overview

Implementation

  • Querying and region operations written in Cpp
  • Queries made with htslib
  • Multithreaded across input files

Usage

data.frame <- region_map(
  tabixed_bedfiles_vec,
  genomic_interval_string_vec,
  fun = "aggregate", # or average
  bismark = FALSE,   # TRUE if biskmark .cov
  mval = TRUE        # FALSE for beta value
)
Bsseq <- query_all(
  tabixed_bedfiles_vec,
  genomic_interval_string_vec,
  bismark = FALSE,
  merged = TRUE,   # FALSE for uncollapsed strands
)
library(iscream)
options("iscream.threads" = 8)
files <- list.files(
  "aligner_output",
  pattern = "*.bed.gz",
  full.names = TRUE
)
regions <- (
  data.table::fread("windows.bed")
  [, paste0(V1, ":", V2, "-", V3)]
)
region_map(files, regions, fun = "aggregate")
query_all(files, regions)

Benchmarks

Runtime

CPU: Intel Xeon CPU E5-2698 v4 @ 2.20GHz

Benchmarks

Memory usage

CPU: Intel Xeon CPU E5-2698 v4 @ 2.20GHz

Benchmarks

Runtime

CPU: Intel Xeon CPU E5-2698 v4 @ 2.20GHz

Benchmarks

CPU: Intel Xeon CPU E5-2698 v4 @ 2.20GHz

Benchmarks

Tracking CpG rows with std::unordered_map vs klib

github.com/attractivechaos/klib

Future directions

  • Fast reading of full files
  • Support for converting to and reading from arrow files and datasets
  • Comprehensive data checks
  • No-copy matrix creation
  • Sparse matrix support
  • Basic analysis capabilities
  • Support for epiBED files

Summary

  • iscream is an R package to read WGBS data into analysis-ready data structures
  • Can query genomic regions to extract CpGs only from regions of interest
  • Collect all CpGs into matrices or aggregate/average them within windows
  • Queries are parallelized across files

Acknowledgements

Shen lab

Ian Beddows
Svetlana Djirackor
Ben Johnson
Kelly Kryzanowski
Paige Matusiak
Amy Nuffesse (Admin)
Mary Rhodes
Ayush Semwal
Tashi Zandstra


image/svg+xml

github.com/huishenlab/iscream

References

Hansen, K. D., Langmead, B., & Irizarry, R. A. (2012). BSmooth: From whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology, 13(10), R83. https://doi.org/10.1186/gb-2012-13-10-r83
Kapourani, C.-A., Argelaguet, R., Sanguinetti, G., & Vallejos, C. A. (2021). scMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolution. Genome Biology, 22(1), 114. https://doi.org/10.1186/s13059-021-02329-8
Korthauer, K., Chakraborty, S., Benjamini, Y., & Irizarry, R. A. (2019). Detection and accurate false discovery rate control of differentially methylated regions from whole genome bisulfite sequencing. Biostatistics, 20(3), 367–383. https://doi.org/10.1093/biostatistics/kxy007
Krueger, F., & Andrews, S. R. (2011). Bismark: A flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics, 27(11), 1571–1572. https://doi.org/10.1093/bioinformatics/btr167
Peters, T. J., Buckley, M. J., Chen, Y., Smyth, G. K., Goodnow, C. C., & Clark, S. J. (2021). Calling differentially methylated regions from whole genome bisulphite sequencing with DMRcate. Nucleic Acids Research, 49(19), e109–e109. https://doi.org/10.1093/nar/gkab637
Zhou, W., Johnson, B. K., Morrison, J., Beddows, I., Eapen, J., Katsman, E., Semwal, A., Habib, W. A., Heo, L., Laird, P. W., Berman, B. P., Triche, T. J., Jr., & Shen, H. (2024). BISCUIT: An efficient, standards-compliant tool suite for simultaneous genetic and epigenetic inference in bulk and single-cell studies. Nucleic Acids Research, gkae097. https://doi.org/10.1093/nar/gkae097